Himalayan_Expedition_Deaths

Author

Jayjit Das

This dataset allows us to delve into feature engineering procedures, such as subsampling to address class imbalance (given the significantly higher number of survivors compared to fatalities) and imputing missing data, particularly for expedition members with incomplete information, such as age.

Goal: To predict the probability of an Himalayan expedition member surviving or dying

Code
library(tidyverse)
library(tidymodels)
library(skimr)
library(plotly)
library(knitr)

Exploratory data analysis

Loading and exploring the dataset

Code
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
skim(members)
Data summary
Name members
Number of rows 76519
Number of columns 21
_______________________
Column type frequency:
character 10
logical 6
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
expedition_id 0 1.00 9 9 0 10350 0
member_id 0 1.00 12 12 0 76518 0
peak_id 0 1.00 4 4 0 391 0
peak_name 15 1.00 4 25 0 390 0
season 0 1.00 6 7 0 5 0
sex 2 1.00 1 1 0 2 0
citizenship 10 1.00 2 23 0 212 0
expedition_role 21 1.00 4 25 0 524 0
death_cause 75413 0.01 3 27 0 12 0
injury_type 74807 0.02 3 27 0 11 0

Variable type: logical

skim_variable n_missing complete_rate mean count
hired 0 1 0.21 FAL: 60788, TRU: 15731
success 0 1 0.38 FAL: 47320, TRU: 29199
solo 0 1 0.00 FAL: 76398, TRU: 121
oxygen_used 0 1 0.24 FAL: 58286, TRU: 18233
died 0 1 0.01 FAL: 75413, TRU: 1106
injured 0 1 0.02 FAL: 74806, TRU: 1713

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2000.36 14.78 1905 1991 2004 2012 2019 ▁▁▁▃▇
age 3497 0.95 37.33 10.40 7 29 36 44 85 ▁▇▅▁▁
highpoint_metres 21833 0.71 7470.68 1040.06 3800 6700 7400 8400 8850 ▁▁▆▃▇
death_height_metres 75451 0.01 6592.85 1308.19 400 5800 6600 7550 8830 ▁▁▂▇▆
injury_height_metres 75510 0.01 7049.91 1214.24 400 6200 7100 8000 8880 ▁▁▂▇▇

How has the rate of expedition success and member death changed over time?

Code
plot1 <- members %>%
  group_by(year = 10 * (year %/% 10)) %>%
  summarise(
    died = mean(died),
    success = mean(success)
  ) %>%
  pivot_longer(died:success, names_to = "outcome", values_to = "percent") %>%
  ggplot(aes(year, percent, color = outcome)) +
  geom_line(alpha = 0.7, size = 1.5) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "year", y = "% of expedition members", color = NULL)

plotly::ggplotly(plot1) 

Is there a relationship between the expedition member’s age and success of the expedition or death? We can use the same code but just switch out year for age.

Code
plot2 <- members %>%
  group_by(age = 10 * (age %/% 10)) %>%
  summarise(
    died = mean(died),
    success = mean(success)
  ) %>%
  pivot_longer(died:success, names_to = "outcome", values_to = "percent") %>%
  ggplot(aes(age, percent, color = outcome)) +
  geom_line(alpha = 0.7, size = 1.5) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "age", y = "% of expedition members", color = NULL)

plotly::ggplotly(plot2) 

Are people more likely to die on unsuccessful expeditions?

Code
members %>%
  count(success, died) %>%
  group_by(success) %>%
  mutate(percent = scales::percent(n / sum(n))) %>%
  kable(
    col.names = c("Expedition success", "Died", "Number of people", "% of people"),
    align = "llrr"
  )
Expedition success Died Number of people % of people
FALSE FALSE 46452 98%
FALSE TRUE 868 2%
TRUE FALSE 28961 99%
TRUE TRUE 238 1%

We can use a similar approach to see how different the rates of death are on different peaks in the Himalayas.

Code
members %>%
  filter(!is.na(peak_name)) %>%
  mutate(peak_name = fct_lump(peak_name, prop = 0.05)) %>%
  count(peak_name, died) %>%
  group_by(peak_name) %>%
  mutate(percent = scales::percent(n / sum(n))) %>%
  kable(
    col.names = c("Peak", "Died", "Number of people", "% of people"),
    align = "llrr"
  )
Peak Died Number of people % of people
Ama Dablam FALSE 8374 100%
Ama Dablam TRUE 32 0%
Cho Oyu FALSE 8838 99%
Cho Oyu TRUE 52 1%
Everest FALSE 21507 99%
Everest TRUE 306 1%
Manaslu FALSE 4508 98%
Manaslu TRUE 85 2%
Other FALSE 32171 98%
Other TRUE 631 2%

Let’s make one last exploratory plot and look at seasons. How much difference is there in survival across the four seasons?

Code
plot3 <- members %>%
  filter(season != "Unknown") %>%
  count(season, died) %>%
  group_by(season) %>%
  mutate(
    percent = n / sum(n),
    died = case_when(
      died ~ "Died",
      TRUE ~ "Did not die"
    )
  ) %>%
  ggplot(aes(season, percent, fill = season)) +
  geom_col(alpha = 0.8, position = "dodge", show.legend = FALSE) +
  scale_y_continuous(labels = scales::percent_format()) +
  facet_wrap(~died, scales = "free") +
  labs(x = NULL, y = "% of expedition members")

plotly::ggplotly(plot3)

Let’s now create the dataset that we’ll use for modeling by filtering on some of the variables and transforming some variables to a be factors. There are still lots of NA values for age but we are going to impute those.

Code
members_df <- members %>%
  filter(season != "Unknown", !is.na(sex), !is.na(citizenship)) %>%
  select(peak_id, year, season, sex, age, citizenship, hired, success, died) %>%
  mutate(died = case_when(
    died ~ "died",
    TRUE ~ "survived"
  )) %>%
  mutate_if(is.character, factor) %>%
  mutate_if(is.logical, as.integer)

members_df
# A tibble: 76,507 × 9
   peak_id  year season sex     age citizenship hired success died    
   <fct>   <dbl> <fct>  <fct> <dbl> <fct>       <int>   <int> <fct>   
 1 AMAD     1978 Autumn M        40 France          0       0 survived
 2 AMAD     1978 Autumn M        41 France          0       0 survived
 3 AMAD     1978 Autumn M        27 France          0       0 survived
 4 AMAD     1978 Autumn M        40 France          0       0 survived
 5 AMAD     1978 Autumn M        34 France          0       0 survived
 6 AMAD     1978 Autumn M        25 France          0       0 survived
 7 AMAD     1978 Autumn M        41 France          0       0 survived
 8 AMAD     1978 Autumn M        29 France          0       0 survived
 9 AMAD     1979 Spring M        35 USA             0       0 survived
10 AMAD     1979 Spring M        37 W Germany       0       1 survived
# ℹ 76,497 more rows

Building a model

We’ll split or spend our data to generate training and testing sets.

Code
set.seed(111)
members_split <- initial_split(members_df, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)

We use resampling to evaluate model performance. Getting those resampled sets ready.

Code
set.seed(111)
members_folds <- vfold_cv(members_train, strata = died)
members_folds
#  10-fold cross-validation using stratification 
# A tibble: 10 × 2
   splits               id    
   <list>               <chr> 
 1 <split [51642/5738]> Fold01
 2 <split [51642/5738]> Fold02
 3 <split [51642/5738]> Fold03
 4 <split [51642/5738]> Fold04
 5 <split [51642/5738]> Fold05
 6 <split [51642/5738]> Fold06
 7 <split [51642/5738]> Fold07
 8 <split [51642/5738]> Fold08
 9 <split [51642/5738]> Fold09
10 <split [51642/5738]> Fold10

Next we build a recipe for data preprocessing.

  • First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is.

  • Next, we impute the missing values for age using the median age in the training data set. There are more complex steps available for imputation, but we’ll stick with a straightforward option here.

  • Next, we use step_other() to collapse categorical levels for peak and citizenship. Before this step, there were hundreds of values in each variable.

  • After this, we can create indicator variables for the non-numeric, categorical values, except for the outcome died which we need to keep as a factor.

  • Finally, there are many more people who survived their expedition than who died (thankfully) so we will use step_smote() to balance the classes.

The object members_rec is a recipe that has not been trained on data yet (for example, which categorical levels should be collapsed has not been calculated).

Code
library(themis)

members_rec <- recipe(died ~ ., data = members_train) %>%
  step_impute_median(age) %>%
  step_other(peak_id, citizenship) %>%
  step_dummy(all_nominal(), -died) %>%
  step_smote(died)

members_rec

Comparing two different models: a logistic regression model and a random forest model.

Code
glm_spec <- logistic_reg() %>%
  set_engine("glm")

glm_spec
Logistic Regression Model Specification (classification)

Computational engine: glm 
Code
rf_spec <- rand_forest(trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("ranger")

rf_spec
Random Forest Model Specification (classification)

Main Arguments:
  trees = 1000

Computational engine: ranger 

Putting together tidymodels workflow()

Code
members_wf <- workflow() %>%
  add_recipe(members_rec)

members_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: None

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_impute_median()
• step_other()
• step_dummy()
• step_smote()

Now we can add a model, and the fit to each of the resamples. First, we can fit the logistic regression model. Setting a non-default metric set so we can add sensitivity and specificity.

Code
members_metrics <- metric_set(roc_auc, accuracy, sensitivity, specificity)

glm_rs <- members_wf %>%
  add_model(glm_spec) %>%
  fit_resamples(
    resamples = members_folds,
    metrics = members_metrics,
    control = control_resamples(save_pred = TRUE)
  )

glm_rs
# Resampling results
# 10-fold cross-validation using stratification 
# A tibble: 10 × 5
   splits               id     .metrics         .notes           .predictions
   <list>               <chr>  <list>           <list>           <list>      
 1 <split [51642/5738]> Fold01 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 2 <split [51642/5738]> Fold02 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 3 <split [51642/5738]> Fold03 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 4 <split [51642/5738]> Fold04 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 5 <split [51642/5738]> Fold05 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 6 <split [51642/5738]> Fold06 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 7 <split [51642/5738]> Fold07 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 8 <split [51642/5738]> Fold08 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 9 <split [51642/5738]> Fold09 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
10 <split [51642/5738]> Fold10 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    

Now fitting the random forest model.

Code
rf_rs <- members_wf %>%
  add_model(rf_spec) %>%
  fit_resamples(
    resamples = members_folds,
    metrics = members_metrics,
    control = control_resamples(save_pred = TRUE)
  ) 
Code
rf_rs
# Resampling results
# 10-fold cross-validation using stratification 
# A tibble: 10 × 5
   splits               id     .metrics         .notes           .predictions
   <list>               <chr>  <list>           <list>           <list>      
 1 <split [51642/5738]> Fold01 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 2 <split [51642/5738]> Fold02 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 3 <split [51642/5738]> Fold03 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 4 <split [51642/5738]> Fold04 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 5 <split [51642/5738]> Fold05 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 6 <split [51642/5738]> Fold06 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 7 <split [51642/5738]> Fold07 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 8 <split [51642/5738]> Fold08 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
 9 <split [51642/5738]> Fold09 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    
10 <split [51642/5738]> Fold10 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>    

We have fit each of our candidate models to our resampled training set!

Evaluating model

Let’s evaluate our models.

Code
collect_metrics(glm_rs)
# A tibble: 4 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.615    10 0.00264 Preprocessor1_Model1
2 roc_auc     binary     0.703    10 0.0117  Preprocessor1_Model1
3 sensitivity binary     0.684    10 0.0158  Preprocessor1_Model1
4 specificity binary     0.614    10 0.00267 Preprocessor1_Model1

Well, this is middling but at least mostly consistent for the positive and negative classes. The function collect_metrics() extracts and formats the .metrics column from resampling results like the ones we have here.

Code
collect_metrics(rf_rs)
# A tibble: 4 × 6
  .metric     .estimator  mean     n  std_err .config             
  <chr>       <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy    binary     0.974    10 0.000757 Preprocessor1_Model1
2 roc_auc     binary     0.756    10 0.0131   Preprocessor1_Model1
3 sensitivity binary     0.152    10 0.00992  Preprocessor1_Model1
4 specificity binary     0.986    10 0.000690 Preprocessor1_Model1

The accuracy is great but that sensitivity is really bad with respect to logistic regression. The random forest model has not learnt recognition of both classes properly, even with the oversampling strategy. Let’s dig deeper into how these models are doing to see this more. For example, how are they predicting the two classes?

Code
glm_rs %>%
  conf_mat_resampled()
# A tibble: 4 × 3
  Prediction Truth      Freq
  <fct>      <fct>     <dbl>
1 died       died       56  
2 died       survived 2182  
3 survived   died       25.9
4 survived   survived 3474. 
Code
rf_rs %>%
  conf_mat_resampled()
# A tibble: 4 × 3
  Prediction Truth      Freq
  <fct>      <fct>     <dbl>
1 died       died       12.5
2 died       survived   80.3
3 survived   died       69.4
4 survived   survived 5576. 

The random forest model is quite bad at identifying which expedition members died, while the logistic regression model does about the same for both classes.

Visualizing the ROC curve

Code
plot5 <- glm_rs %>%
  collect_predictions() %>%
  group_by(id) %>%
  roc_curve(died, .pred_died) %>%
  ggplot(aes(1 - specificity, sensitivity, color = id)) +
  geom_abline(lty = 2, color = "gray9 ", size = 1.5) +
  geom_path(show.legend = FALSE, alpha = 0.6, size = 1.2) +
  coord_equal()

plotly::ggplotly(plot5)

It is finally time for us to return to the testing set. Notice that we have not used the testing set yet during this whole analysis; to compare and assess models we used resamples of the training set. Let’s fit one more time to the training data and evaluate on the testing data using the function last_fit().

Code
members_final <- members_wf %>%
  add_model(glm_spec) %>%
  last_fit(members_split)

members_final
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits                id             .metrics .notes   .predictions .workflow 
  <list>                <chr>          <list>   <list>   <list>       <list>    
1 <split [57380/19127]> train/test sp… <tibble> <tibble> <tibble>     <workflow>

The metrics and predictions here are on the testing data.

Code
collect_metrics(members_final)
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.615 Preprocessor1_Model1
2 roc_auc  binary         0.689 Preprocessor1_Model1

Confusion matrix

Code
collect_predictions(members_final) %>%
  conf_mat(died, .pred_class)
          Truth
Prediction  died survived
  died       191     7273
  survived    96    11567

Coefficients of predictors

The coefficients (which we can get out using tidy()) have been estimated using the training data. If we use exponentiate = TRUE, we have odds ratios.

Code
members_final %>%
  pull(.workflow) %>%
  pluck(1) %>%
  tidy(exponentiate = TRUE) %>%
  arrange(estimate) %>%
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 0.000 0.938 -53.347 0.000
peak_id_MANA 0.263 0.041 -32.916 0.000
peak_id_other 0.266 0.033 -40.502 0.000
peak_id_EVER 0.362 0.035 -29.227 0.000
sex_M 0.460 0.030 -25.612 0.000
hired 0.488 0.056 -12.923 0.000
citizenship_other 0.587 0.032 -16.847 0.000
citizenship_Japan 0.706 0.037 -9.309 0.000
citizenship_Nepal 0.875 0.063 -2.133 0.033
season_Winter 0.884 0.042 -2.940 0.003
season_Spring 0.951 0.016 -3.231 0.001
age 0.991 0.001 -12.958 0.000
year 1.027 0.000 55.770 0.000
peak_id_CHOY 1.101 0.041 2.339 0.019
citizenship_UK 1.773 0.045 12.780 0.000
citizenship_USA 1.815 0.044 13.604 0.000
success 2.184 0.016 48.953 0.000
season_Summer 3.689 0.080 16.289 0.000

Visualizing the result

Code
plot4 <- members_final %>%
  pull(.workflow) %>%
  pluck(1) %>%
  tidy() %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(estimate, fct_reorder(term, estimate))) +
  geom_vline(xintercept = 0, color = "gray7", lty = 2, size = 1.2) +
  geom_errorbar(aes(
    xmin = estimate - std.error,
    xmax = estimate + std.error
  ),
  width = .2, color = "gray7", alpha = 0.7
  ) +
  geom_point(size = 2, color = "darkcyan") +
  labs(y = NULL, x = "Coefficent from logistic regression")

plotly::ggplotly(plot4)
  • The features with coefficients on the positive side (like climbing in summer, being on a successful expedition, or being from the UK or US) are associated with surviving.

  • The features with coefficients on the negative side (like climbing specific peaks including Everest, being one of the hired members of a expedition, or being a man) are associated with dying.

We need to consider the interpretation of model coefficients in the context of our model’s moderate predictive accuracy. The model may not capture all the factors influencing expedition survival. Additionally, the results indicate a heightened risk for native Sherpa climbers hired as expedition members in Nepal, underscoring the dangers associated with this particular demographic group during mountain expeditions.

Further readings:

1) Himalayan Climbing Expeditions modelling.

2) Online platform to learn tidymodels.

3) Tidymodels in R book.